fix: resolve critical relative path issues in packaged build#185
Conversation
- Move node_types data into package (data/ -> src/codeweaver/data/) - Update node_type_parser.py to use importlib.resources - Add pickle cache support for pre-processed node types - Create build scripts: - scripts/build/generate-schema.py (schema generation) - scripts/build/preprocess-node-types.py (cache generation) - scripts/build/prepare-build.py (orchestrator) - Remove problematic save_schema() from settings.py - Fix git version detection in __init__.py (remove invalid cwd) - Update pyproject.toml to include src/codeweaver/data/ This fixes the critical bug where relative paths like Path(__file__).parent.parent.parent.parent broke when the package was flattened during build (src/codeweaver/ becomes /). The node_types are now: 1. Located inside the package at src/codeweaver/data/node_types/ 2. Pre-processed during build and cached as pickle 3. Loaded from cache at runtime via importlib.resources 4. Still support dynamic loading for future user-defined grammars
Creates a new mise task that automates the version release workflow: - Runs prepare-build.py to generate all build artifacts - Updates CHANGELOG.md with git-cliff - Commits build artifacts and changelog - Creates git tag - Optionally pushes to remote Usage: mise run prepare-version # Interactive - prompts for version mise run prepare-version 1.2.0 # Specify version mise run prepare-version 1.2.0 --push # Auto-push after tagging mise run prepare-version --no-commit # Just generate artifacts mise run prepare-version --no-tag # Commit but don't tag This consolidates the release preparation process into a single command, ensuring all build artifacts are generated correctly before each release.
Reviewer's GuideRefactors how node type data and schemas are packaged and loaded so builds no longer depend on fragile relative paths, adds a precomputed node-types cache for faster startup, and introduces scripted build + release preparation (including schema generation) wired into a new mise prepare-version task. Sequence diagram for node type parser initialization and loading with cachesequenceDiagram
actor Runtime
participant NodeTypeParser
participant NodeTypeFileLoader
participant PackageData as PackageResources_codeweaver_data
Runtime->>NodeTypeParser: __init__(languages, use_cache)
NodeTypeParser->>NodeTypeParser: set _languages
NodeTypeParser->>NodeTypeFileLoader: create NodeTypeFileLoader(directory=None)
NodeTypeParser->>NodeTypeParser: set _use_cache
alt use_cache is true and _cache_loaded is false
NodeTypeParser->>NodeTypeParser: _load_from_cache()
NodeTypeParser->>PackageData: files("codeweaver.data")/"node_types_cache.pkl"
alt cache file exists and is valid
PackageData-->>NodeTypeParser: cache bytes
NodeTypeParser->>NodeTypeParser: pickle.loads(bytes)
NodeTypeParser->>NodeTypeParser: set _registration_cache
NodeTypeParser->>NodeTypeParser: set _cache_loaded True
else cache missing or invalid
PackageData-->>NodeTypeParser: error or missing
NodeTypeParser->>NodeTypeParser: log warning and fall back
end
else use_cache is false or _cache_loaded already true
NodeTypeParser-->>Runtime: skip cache loading
end
Note over Runtime,NodeTypeParser: Later, when parsing without cache file
Runtime->>NodeTypeParser: parse_all_nodes()
NodeTypeParser->>NodeTypeFileLoader: load_all_files()
NodeTypeFileLoader->>PackageData: files("codeweaver.data")/"node_types"
PackageData-->>NodeTypeFileLoader: directory listing of *"node-types.json"
NodeTypeFileLoader-->>NodeTypeParser: list of Paths
NodeTypeParser->>NodeTypeParser: populate _registration_cache from JSON
Class diagram for updated node type loading and cachingclassDiagram
class SemanticSearchLanguage {
}
class NodeTypeFileLoader {
+DirectoryPath~None directory
+list~Path~ files
+list~NodeArray~ _nodes
+NodeTypeFileLoader(directory DirectoryPath~None, files list~Path~None)
+_load_file(language SemanticSearchLanguage) list~dict~str, Any~~None
}
class NodeTypeParser {
+frozenset~SemanticSearchLanguage~ _languages
+NodeTypeFileLoader _loader
+bool _use_cache
+dict _registration_cache
+bool _cache_loaded
+NodeTypeParser(languages Sequence~SemanticSearchLanguage~None, use_cache bool)
+_load_from_cache() bool
+parse_all_nodes() list
+nodes list~NodeArray~
}
NodeTypeParser *-- NodeTypeFileLoader : uses
NodeTypeParser ..> SemanticSearchLanguage : registers
NodeTypeFileLoader ..> SemanticSearchLanguage : loads files for
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
✅ CLA Check PassedAll contributors to this PR are exempt from the CLA requirement:
Knitli organization members and recognized bots do not need to sign the CLA. Thanks for your contribution to codeweaver! 🧵 |
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
Blocking issues:
- Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)
General comments:
- In
preprocess-node-types.pyyou're reaching into the privateNodeTypeParser._registration_cache; consider exposing a public method or property to retrieve the registration cache so the build script isn't coupled to an internal attribute that may change. - The pickle written by
preprocess-node-types.pystores boththingsandregistration_cache, but_load_from_cacheonly consumesregistration_cache; either load and reuse the precomputedthingsas well or drop them from the cache payload to avoid unnecessary file size and confusion about what is actually used. - In
get_version()the code now silently falls back to"0.0.0"whengit describefails; consider logging a debug or warning message in that branch so unexpected version resolution failures are visible during debugging.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `preprocess-node-types.py` you're reaching into the private `NodeTypeParser._registration_cache`; consider exposing a public method or property to retrieve the registration cache so the build script isn't coupled to an internal attribute that may change.
- The pickle written by `preprocess-node-types.py` stores both `things` and `registration_cache`, but `_load_from_cache` only consumes `registration_cache`; either load and reuse the precomputed `things` as well or drop them from the cache payload to avoid unnecessary file size and confusion about what is actually used.
- In `get_version()` the code now silently falls back to `"0.0.0"` when `git describe` fails; consider logging a debug or warning message in that branch so unexpected version resolution failures are visible during debugging.
## Individual Comments
### Comment 1
<location> `scripts/build/preprocess-node-types.py:36-42` </location>
<code_context>
+ print(f" Parsed {len(all_things)} Things/Categories across all languages")
+
+ # Get the cache from the parser's internal registry
+ cache_data = {
+ "things": all_things,
+ "registration_cache": parser._registration_cache,
+ }
+
</code_context>
<issue_to_address>
**suggestion (performance):** Consider whether storing `things` in the cache is necessary given the current load path.
Right now the cache includes both `"things"` and `"registration_cache"`, but `_load_from_cache` only reads `registration_cache`. If `things` isn’t used elsewhere, you can remove it from the cache to shrink the pickle and avoid serializing large data unnecessarily.
If you expect to load `things` from the cache later, consider adding a comment or assertion in `_load_from_cache` to document that contract and keep the cache shape intentional.
```suggestion
print(f" Parsed {len(all_things)} Things/Categories across all languages")
# Only persist the parser's registration cache. The Thing/Category objects
# are reconstructed from source at runtime and are not needed in the cache.
cache_data = {
"registration_cache": parser._registration_cache,
}
```
</issue_to_address>
### Comment 2
<location> `src/codeweaver/__init__.py:54-59` </location>
<code_context>
git_describe = subprocess.run(
[git, "describe", "--tags", "--always", "--dirty"],
capture_output=True,
text=True,
check=False,
)
</code_context>
<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
*Source: opengrep*
</issue_to_address>
### Comment 3
<location> `src/codeweaver/semantic/node_type_parser.py:466-469` </location>
<code_context>
def _load_file(self, language: SemanticSearchLanguage) -> list[dict[str, Any]] | None:
"""Load a single node types file for a specific language."""
if language == SemanticSearchLanguage.JSX:
language = SemanticSearchLanguage.JAVASCRIPT
# If using package resources, load directly
if self.directory is None:
data_dir = files("codeweaver.data") / "node_types"
filename = f"{language.value}-node-types.json"
resource = data_dir / filename
if resource.is_file():
return from_json(resource.read_bytes())
return None
# Otherwise use file path
file_path = next(
(
file
for file in self.files
if SemanticSearchLanguage.from_string(file.stem.replace("-node-types", ""))
== language
),
None,
)
if file_path and file_path.exists():
return from_json(file_path.read_bytes())
return None
</code_context>
<issue_to_address>
**suggestion (code-quality):** We've found these issues:
- Lift code into else after jump in control flow ([`reintroduce-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/reintroduce-else/))
- Replace if statement with if expression ([`assign-if-exp`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/assign-if-exp/))
```suggestion
return from_json(resource.read_bytes()) if resource.is_file() else None
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| git_describe = subprocess.run( | ||
| ["describe", "--tags", "--always", "--dirty"], # noqa: S607 | ||
| executable=git, | ||
| [git, "describe", "--tags", "--always", "--dirty"], | ||
| capture_output=True, | ||
| text=True, | ||
| check=True, | ||
| cwd=str(Path(__file__).parent.parent.parent), | ||
| check=False, | ||
| ) |
There was a problem hiding this comment.
security (python.lang.security.audit.dangerous-subprocess-use-audit): Detected subprocess function 'run' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.
Source: opengrep
There was a problem hiding this comment.
Pull request overview
This pull request fixes critical relative path issues that broke the package when installed via pip/uv by moving data files into the package structure and using importlib.resources for resource loading. Key improvements include adding a pickle cache for performance and implementing proper resource access patterns.
Key Changes
- Replaced hardcoded relative paths with
importlib.resources.files()for robust package resource access - Added pickle cache support for pre-processed node types (performance optimization)
- Made directory parameter optional in
NodeTypeFileLoader(defaults to package resources)
Reviewed changes
Copilot reviewed 9 out of 43 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/codeweaver/semantic/node_type_parser.py | Updated to use importlib.resources for loading node types, added pickle cache support, made directory parameter optional |
| src/codeweaver/data/node_types/*.json | Added node type definition files for yaml, typescript, solidity, and json to package data |
| src/codeweaver/data/node_types/analysis/*.license | Added license files for analysis data |
| src/codeweaver/data/init.py | Added package init file for data directory |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # If using package resources, load directly | ||
| if self.directory is None: | ||
| data_dir = files("codeweaver.data") / "node_types" | ||
| filename = f"{language.value}-node-types.json" | ||
| resource = data_dir / filename | ||
| if resource.is_file(): | ||
| return from_json(resource.read_bytes()) | ||
| return None |
There was a problem hiding this comment.
The _load_file method has inconsistent return behavior. When using package resources (line 467), it returns from_json(resource.read_bytes()) or None, but when using file paths (line 471), it returns the result of from_json() which may not be None for missing files. The None return should be explicitly documented in the docstring, and the return type annotation should be updated to reflect that this method can return None.
| def __init__( | ||
| self, languages: Sequence[SemanticSearchLanguage] | None = None, use_cache: bool = True | ||
| ) -> None: | ||
| """Initialize NodeTypeParser with an optional NodeTypeFileLoader. | ||
|
|
||
| Args: | ||
| languages: Optional pre-loaded list of languages to parse; if None, will load all available languages. | ||
| use_cache: Whether to try loading from the pre-built cache. Default True. | ||
| """ |
There was a problem hiding this comment.
The use_cache parameter should be documented in the class docstring, not just in the __init__ method. Additionally, consider documenting what happens when the cache fails to load (falls back to parsing from JSON files).
There was a problem hiding this comment.
Code Review Summary
This PR successfully addresses the critical packaging bug and introduces excellent build infrastructure. However, there are two important issues that should be fixed before merging.
🚨 Key Issues
1. Remove Unused Cache Data (scripts/build/preprocess-node-types.py:39-42)
Sourcery correctly identified that "things" is stored in the cache but never loaded at runtime (see node_type_parser.py:584). This wastes disk space and creates confusion.
Recommendation: Remove it from the cache:
# Only persist the parser's registration cache
cache_data = {
"registration_cache": parser._registration_cache,
}2. Decouple Build Script from Private Attributes
The build script accesses parser._registration_cache (private attribute). Consider adding a public property to NodeTypeParser:
@property
def registration_cache(self) -> dict[SemanticSearchLanguage, _ThingCacheDict]:
"""Get the registration cache for serialization."""
return self._registration_cache🛡️ Security Review
The subprocess call flagged by Sourcery in src/codeweaver/__init__.py:54 is a false positive. The git path from shutil.which() is OS-controlled, not user input, so there's no command injection risk. You can safely ignore this warning or add a clarifying comment.
✅ What's Good
- Excellent problem identification and solution approach
- Clean build infrastructure with proper orchestration
- Performance improvements via pickle caching
- Follows Python packaging best practices
- Good error handling and user feedback
📝 Minor Suggestions
- Add debug logging when git version detection fails (__init__.py:60-65)
- Consider Sourcery's style suggestion for node_type_parser.py:466-469 (minor readability improvement)
🧪 Testing
Wait for the test suite (Python 3.12 & 3.13) to complete before merging.
Recommendation
Request Changes for the two issues above. Once fixed, this will be ready to merge! Great work on solving a complex packaging problem. 🚀
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
|
👋 Hey @bashandbone, Thanks for your contribution to codeweaver! 🧵You need to agree to the CLA first... 🖊️Before we can accept your contribution, you need to agree to our Contributor License Agreement (CLA). To agree to the CLA, please comment:
Those exact words are important1, so please don't change them. 😉 You can read the full CLA here: Contributor License Agreement ✅ @bashandbone has signed the CLA. 0 out of 2 committers have signed the CLA. Footnotes
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 43 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.debug("Node types cache not found, will parse from JSON files") | ||
| return False | ||
|
|
||
| cache_data = pickle.loads(cache_resource.read_bytes()) |
There was a problem hiding this comment.
The pickle.loads() call on line 581 loads untrusted data from package resources without any validation. While this is from package resources and presumably safe during development, a compromised package or build process could inject malicious pickle data. Consider:
- Adding integrity checks (e.g., HMAC signature verification)
- Using a safer serialization format like JSON
- At minimum, wrapping in a try-except with specific exception handling
This is especially important since the cache is loaded by default (use_cache=True).
| except Exception as e: | ||
| logger.warning("Failed to load node types cache, will parse from JSON: %s", e) | ||
| return False |
There was a problem hiding this comment.
The exception handling on line 587 is too broad. Catching all exceptions with except Exception as e may hide important errors like import errors, attribute errors, or corruption issues. Consider:
- Catching specific exceptions (e.g.,
pickle.UnpicklingError,FileNotFoundError) - Re-raising critical errors that shouldn't be silently ignored
- Using different log levels for different error types (e.g.,
logger.errorfor unexpected failures vslogger.debugfor missing cache)
- Remove unused 'things' from cache data (only registration_cache needed) - Add public registration_cache property to NodeTypeParser - Avoid accessing private attributes from build scripts This reduces cache file size and follows better encapsulation practices.
- Add validation checks for cache structure and data types - Replace broad exception handler with specific exception types: - pickle.UnpicklingError, AttributeError, KeyError (data corruption) - FileNotFoundError, OSError (filesystem errors) - Add security documentation explaining pickle trust model - Provide more specific error messages for debugging This makes cache loading more robust and prevents silent failures from masking important errors like import or corruption issues.
There was a problem hiding this comment.
Copilot reviewed 9 out of 43 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Problem: Relative paths like Path(file).parent.parent.parent.parent "data" broke when package was flattened during build.
Solution:
✅ Moved data/node_types/ into package → src/codeweaver/data/node_types/
✅ Updated code to use importlib.resources.files("codeweaver.data")
✅ Added pickle cache for pre-processed node-types (performance boost)
✅ Created comprehensive build infrastructure:
scripts/build/prepare-build.py - Master orchestrator
scripts/build/preprocess-node-types.py - Generate cache
scripts/build/generate-schema.py - Generate schema
✅ Added mise prepare-version task for automated releases
Impact:
🐛 Fixed: Package works correctly when installed via pip/uv
⚡ Performance: Runtime loads from pickle cache (much faster)
🛠️ Maintainable: Automated build process
🔮 Future-proof: Kept dynamic loading for user-defined grammars
Files Changed: 42 files (+298, -32)
Branch: claude/fix-relative-paths-01982YFFb4V8714SNxCMGiqq
Ready to merge after review! 🚀
Summary by Sourcery
Fix packaged builds that relied on repo-relative paths by moving node type data into the package, adding a precomputed cache, and introducing reproducible build-time generation of schemas and artifacts.
New Features:
Bug Fixes:
Enhancements:
Build: